Goto

Collaborating Authors

 St. Petersburg






These rare, giant millipedes only exist in Florida

Popular Science

When a graduate found a baby Florida scrub millipede, she put it in a kiddie pool. Then it got busy reproducing. Breakthroughs, discoveries, and DIY tips sent six days a week. While Florida is perhaps best known for its beaches and wetlands, its landscape hosts other notable features: ridges . Millions of years ago, sea levels were higher than they are today, and these elevated areas of land became like islands.


'People thought I was a communist doing this as a non-profit': is Wikipedia's Jimmy Wales the last decent tech baron?

The Guardian

'People thought I was a communist doing this as a non-profit': is Wikipedia's Jimmy Wales the last decent tech baron? In an online landscape characterised by doom and division, the people's encyclopedia stands out - a huge collective endeavour giving everyone free access to the sum of human knowledge. But with Elon Musk branding it'Wokipedia' and AI looming large, can it survive? W ikipedia will be 25 years old in January. Jimmy Wales's daughter will be 25 and three weeks. It's not a coincidence: on Boxing Day 2000 Wales's then wife, Christine, gave birth to a baby girl, but it quickly became clear that something wasn't right. She had breathed in contaminated amniotic fluid, resulting in a life-threatening condition called meconium aspiration syndrome. An experimental treatment was available at the hospital near where they lived in San Diego. Did they want to try it?


Discrepancy Detection at the Data Level: Toward Consistent Multilingual Question Answering

Calvo-Bartolomé, Lorena, Aldana, Valérie, Cantarero, Karla, de Mesa, Alonso Madroñal, Arenas-García, Jerónimo, Boyd-Graber, Jordan

arXiv.org Artificial Intelligence

Multilingual question answering (QA) systems must ensure factual consistency across languages, especially for objective queries such as What is jaundice?, while also accounting for cultural variation in subjective responses. We propose MIND, a user-in-the-loop fact-checking pipeline to detect factual and cultural discrepancies in multilingual QA knowledge bases. MIND highlights divergent answers to culturally sensitive questions (e.g., Who assists in childbirth?) that vary by region and context. We evaluate MIND on a bilingual QA system in the maternal and infant health domain and release a dataset of bilingual questions annotated for factual and cultural inconsistencies. We further test MIND on datasets from other domains to assess generalization. In all cases, MIND reliably identifies inconsistencies, supporting the development of more culturally aware and factually consistent QA systems.


Beyond Postconditions: Can Large Language Models infer Formal Contracts for Automatic Software Verification?

Richter, Cedric, Wehrheim, Heike

arXiv.org Artificial Intelligence

Automatic software verifiers have become increasingly effective at the task of checking software against (formal) specifications. Yet, their adoption in practice has been hampered by the lack of such specifications in real world code. Large Language Models (LLMs) have shown promise in inferring formal postconditions from natural language hints embedded in code such as function names, comments or documentation. Using the generated postconditions as specifications in a subsequent verification, however, often leads verifiers to suggest invalid inputs, hinting at potential issues that ultimately turn out to be false alarms. To address this, we revisit the problem of specification inference from natural language in the context of automatic software verification. In the process, we introduce NL2Contract, the task of employing LLMs to translate informal natural language into formal functional contracts, consisting of postconditions as well as preconditions. We introduce metrics to validate and compare different NL2Contract approaches, using soundness, bug discriminative power of the generated contracts and their usability in the context of automatic software verification as key metrics. We evaluate NL2Contract with different LLMs and compare it to the task of postcondition generation nl2postcond. Our evaluation shows that (1) LLMs are generally effective at generating functional contracts sound for all possible inputs, (2) the generated contracts are sufficiently expressive for discriminating buggy from correct behavior, and (3) verifiers supplied with LLM inferred functional contracts produce fewer false alarms than when provided with postconditions alone. Further investigations show that LLM inferred preconditions generally align well with developers intentions which allows us to use automatic software verifiers to catch real-world bugs.


WebThinker: Empowering Large Reasoning Models with Deep Research Capability

Li, Xiaoxi, Jin, Jiajie, Dong, Guanting, Qian, Hongjin, Wu, Yongkang, Wen, Ji-Rong, Zhu, Yutao, Dou, Zhicheng

arXiv.org Artificial Intelligence

Large reasoning models (LRMs), such as OpenAI-o1 and DeepSeek-R1, demonstrate impressive long-horizon reasoning capabilities. However, their reliance on static internal knowledge limits their performance on complex, knowledge-intensive tasks and hinders their ability to produce comprehensive research reports requiring synthesis of diverse web information. To address this, we propose WebThinker, a deep research agent that empowers LRMs to autonomously search the web, navigate among web pages, and draft reports during the reasoning process. WebThinker integrates a Deep Web Explorer module, enabling LRMs to dynamically search, navigate, and extract information from the web when encountering knowledge gaps. It also employs an Autonomous Think-Search-and-Draft strategy, allowing the model to seamlessly interleave reasoning, information gathering, and report writing in real time. To further enhance research tool utilization, we introduce an RL-based training strategy via iterative online Direct Preference Optimization (DPO). Extensive experiments on complex reasoning benchmarks (GPQA, GAIA, WebWalkerQA, HLE) and scientific report generation tasks (Glaive) demonstrate that WebThinker significantly outperforms existing methods and strong proprietary systems. Our approach enhances LRM reliability and applicability in complex scenarios, paving the way for more capable and versatile deep research systems. The code is available at https://github.com/RUC-NLPIR/WebThinker.


Learning to Guarantee Type Correctness in Code Generation through Type-Guided Program Synthesis

Huang, Zhechong, Zhang, Zhao, Ji, Ruyi, Xia, Tingxuan, Zhu, Qihao, Cao, Qinxiang, Sun, Zeyu, Xiong, Yingfei

arXiv.org Artificial Intelligence

Language models have shown remarkable proficiency in code generation; nevertheless, ensuring type correctness remains a challenge. Although traditional methods, such as constrained decoding, alleviate this problem by externally rejecting untypable code, the model itself does not effectively learn type reasoning internally, which ultimately limits its overall performance. This paper introduces TyFlow, a novel system that internalizes type reasoning within code generation to guide the model to learn the type system. The core of our approach is a novel type-guided program synthesis system that maintains an isomorphism between type derivation trees and synthesis derivation trees, enabling a new code representation based on synthesis decision sequences rather than traditional text-based token sequences. By offloading the complexity of type system learning to the representation itself, models can redirect their computational resources toward higher-level program semantics. Our evaluation shows that TyFlow not only eliminates type errors but also significantly improves functional correctness, highlighting the importance of aligning LMs with type systems internally.